home *** CD-ROM | disk | FTP | other *** search
Text File | 1994-12-19 | 107.9 KB | 2,773 lines |
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- NNAAMMEE
- flexdoc - documentation for flex, fast lexical analyzer
- generator
-
- SSYYNNOOPPSSIISS
- fflleexx [[--bbccddffhhiillnnppssttvvwwBBFFIILLTTVV7788++ --CC[[aaeeffFFmmrr]] --PPpprreeffiixx --SSsskkeellee--
- ttoonn]] _[_f_i_l_e_n_a_m_e _._._._]
-
- DDEESSCCRRIIPPTTIIOONN
- _f_l_e_x is a tool for generating _s_c_a_n_n_e_r_s_: programs which
- recognized lexical patterns in text. _f_l_e_x reads the given
- input files, or its standard input if no file names are
- given, for a description of a scanner to generate. The
- description is in the form of pairs of regular expressions
- and C code, called _r_u_l_e_s_. _f_l_e_x generates as output a C
- source file, lleexx..yyyy..cc,, which defines a routine yyyylleexx(())..
- This file is compiled and linked with the --llffll library to
- produce an executable. When the executable is run, it
- analyzes its input for occurrences of the regular expres-
- sions. Whenever it finds one, it executes the correspond-
- ing C code.
-
- SSOOMMEE SSIIMMPPLLEE EEXXAAMMPPLLEESS
- First some simple examples to get the flavor of how one
- uses _f_l_e_x_. The following _f_l_e_x input specifies a scanner
- which whenever it encounters the string "username" will
- replace it with the user's login name:
-
- %%
- username printf( "%s", getlogin() );
-
- By default, any text not matched by a _f_l_e_x scanner is
- copied to the output, so the net effect of this scanner is
- to copy its input file to its output with each occurrence
- of "username" expanded. In this input, there is just one
- rule. "username" is the _p_a_t_t_e_r_n and the "printf" is the
- _a_c_t_i_o_n_. The "%%" marks the beginning of the rules.
-
- Here's another simple example:
-
- int num_lines = 0, num_chars = 0;
-
- %%
- \n ++num_lines; ++num_chars;
- . ++num_chars;
-
- %%
- main()
- {
- yylex();
- printf( "# of lines = %d, # of chars = %d\n",
- num_lines, num_chars );
- }
-
-
-
-
- Version 2.4 November 1993 1
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- This scanner counts the number of characters and the num-
- ber of lines in its input (it produces no output other
- than the final report on the counts). The first line
- declares two globals, "num_lines" and "num_chars", which
- are accessible both inside yyyylleexx(()) and in the mmaaiinn(()) rou-
- tine declared after the second "%%". There are two rules,
- one which matches a newline ("\n") and increments both the
- line count and the character count, and one which matches
- any character other than a newline (indicated by the "."
- regular expression).
-
- A somewhat more complicated example:
-
- /* scanner for a toy Pascal-like language */
-
- %{
- /* need this for the call to atof() below */
- #include <math.h>
- %}
-
- DIGIT [0-9]
- ID [a-z][a-z0-9]*
-
- %%
-
- {DIGIT}+ {
- printf( "An integer: %s (%d)\n", yytext,
- atoi( yytext ) );
- }
-
- {DIGIT}+"."{DIGIT}* {
- printf( "A float: %s (%g)\n", yytext,
- atof( yytext ) );
- }
-
- if|then|begin|end|procedure|function {
- printf( "A keyword: %s\n", yytext );
- }
-
- {ID} printf( "An identifier: %s\n", yytext );
-
- "+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
-
- "{"[^}\n]*"}" /* eat up one-line comments */
-
- [ \t\n]+ /* eat up whitespace */
-
- . printf( "Unrecognized character: %s\n", yytext );
-
- %%
-
- main( argc, argv )
- int argc;
- char **argv;
-
-
-
- Version 2.4 November 1993 2
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- {
- ++argv, --argc; /* skip over program name */
- if ( argc > 0 )
- yyin = fopen( argv[0], "r" );
- else
- yyin = stdin;
-
- yylex();
- }
-
- This is the beginnings of a simple scanner for a language
- like Pascal. It identifies different types of _t_o_k_e_n_s and
- reports on what it has seen.
-
- The details of this example will be explained in the fol-
- lowing sections.
-
- FFOORRMMAATT OOFF TTHHEE IINNPPUUTT FFIILLEE
- The _f_l_e_x input file consists of three sections, separated
- by a line with just %%%% in it:
-
- definitions
- %%
- rules
- %%
- user code
-
- The _d_e_f_i_n_i_t_i_o_n_s section contains declarations of simple
- _n_a_m_e definitions to simplify the scanner specification,
- and declarations of _s_t_a_r_t _c_o_n_d_i_t_i_o_n_s_, which are explained
- in a later section.
-
- Name definitions have the form:
-
- name definition
-
- The "name" is a word beginning with a letter or an under-
- score ('_') followed by zero or more letters, digits, '_',
- or '-' (dash). The definition is taken to begin at the
- first non-white-space character following the name and
- continuing to the end of the line. The definition can
- subsequently be referred to using "{name}", which will
- expand to "(definition)". For example,
-
- DIGIT [0-9]
- ID [a-z][a-z0-9]*
-
- defines "DIGIT" to be a regular expression which matches a
- single digit, and "ID" to be a regular expression which
- matches a letter followed by zero-or-more letters-or-
- digits. A subsequent reference to
-
- {DIGIT}+"."{DIGIT}*
-
-
-
-
- Version 2.4 November 1993 3
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- is identical to
-
- ([0-9])+"."([0-9])*
-
- and matches one-or-more digits followed by a '.' followed
- by zero-or-more digits.
-
- The _r_u_l_e_s section of the _f_l_e_x input contains a series of
- rules of the form:
-
- pattern action
-
- where the pattern must be unindented and the action must
- begin on the same line.
-
- See below for a further description of patterns and
- actions.
-
- Finally, the user code section is simply copied to
- lleexx..yyyy..cc verbatim. It is used for companion routines
- which call or are called by the scanner. The presence of
- this section is optional; if it is missing, the second %%%%
- in the input file may be skipped, too.
-
- In the definitions and rules sections, any _i_n_d_e_n_t_e_d text
- or text enclosed in %%{{ and %%}} is copied verbatim to the
- output (with the %{}'s removed). The %{}'s must appear
- unindented on lines by themselves.
-
- In the rules section, any indented or %{} text appearing
- before the first rule may be used to declare variables
- which are local to the scanning routine and (after the
- declarations) code which is to be executed whenever the
- scanning routine is entered. Other indented or %{} text
- in the rule section is still copied to the output, but its
- meaning is not well-defined and it may well cause compile-
- time errors (this feature is present for _P_O_S_I_X compliance;
- see below for other such features).
-
- In the definitions section (but not in the rules section),
- an unindented comment (i.e., a line beginning with "/*")
- is also copied verbatim to the output up to the next "*/".
-
- PPAATTTTEERRNNSS
- The patterns in the input are written using an extended
- set of regular expressions. These are:
-
- x match the character 'x'
- . any character except newline
- [xyz] a "character class"; in this case, the pattern
- matches either an 'x', a 'y', or a 'z'
- [abj-oZ] a "character class" with a range in it; matches
- an 'a', a 'b', any letter from 'j' through 'o',
- or a 'Z'
-
-
-
- Version 2.4 November 1993 4
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- [^A-Z] a "negated character class", i.e., any character
- but those in the class. In this case, any
- character EXCEPT an uppercase letter.
- [^A-Z\n] any character EXCEPT an uppercase letter or
- a newline
- r* zero or more r's, where r is any regular expression
- r+ one or more r's
- r? zero or one r's (that is, "an optional r")
- r{2,5} anywhere from two to five r's
- r{2,} two or more r's
- r{4} exactly 4 r's
- {name} the expansion of the "name" definition
- (see above)
- "[xyz]\"foo"
- the literal string: [xyz]"foo
- \X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
- then the ANSI-C interpretation of \x.
- Otherwise, a literal 'X' (used to escape
- operators such as '*')
- \123 the character with octal value 123
- \x2a the character with hexadecimal value 2a
- (r) match an r; parentheses are used to override
- precedence (see below)
-
-
- rs the regular expression r followed by the
- regular expression s; called "concatenation"
-
-
- r|s either an r or an s
-
-
- r/s an r but only if it is followed by an s. The
- s is not part of the matched text. This type
- of pattern is called as "trailing context".
- ^r an r, but only at the beginning of a line
- r$ an r, but only at the end of a line. Equivalent
- to "r/\n".
-
-
- <s>r an r, but only in start condition s (see
- below for discussion of start conditions)
- <s1,s2,s3>r
- same, but in any of start conditions s1,
- s2, or s3
- <*>r an r in any start condition, even an exclusive one.
-
-
- <<EOF>> an end-of-file
- <s1,s2><<EOF>>
- an end-of-file when in start condition s1 or s2
-
- Note that inside of a character class, all regular expres-
- sion operators lose their special meaning except escape
-
-
-
- Version 2.4 November 1993 5
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- ('\') and the character class operators, '-', ']', and, at
- the beginning of the class, '^'.
-
- The regular expressions listed above are grouped according
- to precedence, from highest precedence at the top to low-
- est at the bottom. Those grouped together have equal
- precedence. For example,
-
- foo|bar*
-
- is the same as
-
- (foo)|(ba(r*))
-
- since the '*' operator has higher precedence than concate-
- nation, and concatenation higher than alternation ('|').
- This pattern therefore matches _e_i_t_h_e_r the string "foo" _o_r
- the string "ba" followed by zero-or-more r's. To match
- "foo" or zero-or-more "bar"'s, use:
-
- foo|(bar)*
-
- and to match zero-or-more "foo"'s-or-"bar"'s:
-
- (foo|bar)*
-
-
- Some notes on patterns:
-
- - A negated character class such as the example "[^A-
- Z]" above _w_i_l_l _m_a_t_c_h _a _n_e_w_l_i_n_e unless "\n" (or an
- equivalent escape sequence) is one of the charac-
- ters explicitly present in the negated character
- class (e.g., "[^A-Z\n]"). This is unlike how many
- other regular expression tools treat negated char-
- acter classes, but unfortunately the inconsistency
- is historically entrenched. Matching newlines
- means that a pattern like [^"]* can match the
- entire input unless there's another quote in the
- input.
-
- - A rule can have at most one instance of trailing
- context (the '/' operator or the '$' operator).
- The start condition, '^', and "<<EOF>>" patterns
- can only occur at the beginning of a pattern, and,
- as well as with '/' and '$', cannot be grouped
- inside parentheses. A '^' which does not occur at
- the beginning of a rule or a '$' which does not
- occur at the end of a rule loses its special prop-
- erties and is treated as a normal character.
-
- The following are illegal:
-
- foo/bar$
-
-
-
- Version 2.4 November 1993 6
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- <sc1>foo<sc2>bar
-
- Note that the first of these, can be written
- "foo/bar\n".
-
- The following will result in '$' or '^' being
- treated as a normal character:
-
- foo|(bar$)
- foo|^bar
-
- If what's wanted is a "foo" or a bar-followed-by-a-
- newline, the following could be used (the special
- '|' action is explained below):
-
- foo |
- bar$ /* action goes here */
-
- A similar trick will work for matching a foo or a
- bar-at-the-beginning-of-a-line.
-
- HHOOWW TTHHEE IINNPPUUTT IISS MMAATTCCHHEEDD
- When the generated scanner is run, it analyzes its input
- looking for strings which match any of its patterns. If
- it finds more than one match, it takes the one matching
- the most text (for trailing context rules, this includes
- the length of the trailing part, even though it will then
- be returned to the input). If it finds two or more
- matches of the same length, the rule listed first in the
- _f_l_e_x input file is chosen.
-
- Once the match is determined, the text corresponding to
- the match (called the _t_o_k_e_n_) is made available in the
- global character pointer yyyytteexxtt,, and its length in the
- global integer yyyylleenngg.. The _a_c_t_i_o_n corresponding to the
- matched pattern is then executed (a more detailed descrip-
- tion of actions follows), and then the remaining input is
- scanned for another match.
-
- If no match is found, then the _d_e_f_a_u_l_t _r_u_l_e is executed:
- the next character in the input is considered matched and
- copied to the standard output. Thus, the simplest legal
- _f_l_e_x input is:
-
- %%
-
- which generates a scanner that simply copies its input
- (one character at a time) to its output.
-
- Note that yyyytteexxtt can be defined in two different ways:
- either as a character _p_o_i_n_t_e_r or as a character _a_r_r_a_y_.
- You can control which definition _f_l_e_x uses by including
- one of the special directives %%ppooiinntteerr or %%aarrrraayy in the
- first (definitions) section of your flex input. The
-
-
-
- Version 2.4 November 1993 7
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- default is %%ppooiinntteerr,, unless you use the --ll lex compatibil-
- ity option, in which case yyyytteexxtt will be an array. The
- advantage of using %%ppooiinntteerr is substantially faster scan-
- ning and no buffer overflow when matching very large
- tokens (unless you run out of dynamic memory). The disad-
- vantage is that you are restricted in how your actions can
- modify yyyytteexxtt (see the next section), and calls to the
- iinnppuutt(()) and uunnppuutt(()) functions destroy the present contents
- of yyyytteexxtt,, which can be a considerable porting headache
- when moving between different _l_e_x versions.
-
- The advantage of %%aarrrraayy is that you can then modify yyyytteexxtt
- to your heart's content, and calls to iinnppuutt(()) and uunnppuutt(())
- do not destroy yyyytteexxtt (see below). Furthermore, existing
- _l_e_x programs sometimes access yyyytteexxtt externally using dec-
- larations of the form:
- extern char yytext[];
- This definition is erroneous when used with %%ppooiinntteerr,, but
- correct for %%aarrrraayy..
-
- %%aarrrraayy defines yyyytteexxtt to be an array of YYYYLLMMAAXX characters,
- which defaults to a fairly large value. You can change
- the size by simply #define'ing YYYYLLMMAAXX to a different value
- in the first section of your _f_l_e_x input. As mentioned
- above, with %%ppooiinntteerr yytext grows dynamically to accomo-
- date large tokens. While this means your %%ppooiinntteerr scanner
- can accomodate very large tokens (such as matching entire
- blocks of comments), bear in mind that each time the scan-
- ner must resize yyyytteexxtt it also must rescan the entire
- token from the beginning, so matching such tokens can
- prove slow. yyyytteexxtt presently does _n_o_t dynamically grow if
- a call to uunnppuutt(()) results in too much text being pushed
- back; instead, a run-time error results.
-
- Also note that you cannot use %%aarrrraayy with C++ scanner
- classes (the --++ option; see below).
-
- AACCTTIIOONNSS
- Each pattern in a rule has a corresponding action, which
- can be any arbitrary C statement. The pattern ends at the
- first non-escaped whitespace character; the remainder of
- the line is its action. If the action is empty, then when
- the pattern is matched the input token is simply dis-
- carded. For example, here is the specification for a pro-
- gram which deletes all occurrences of "zap me" from its
- input:
-
- %%
- "zap me"
-
- (It will copy all other characters in the input to the
- output since they will be matched by the default rule.)
-
- Here is a program which compresses multiple blanks and
-
-
-
- Version 2.4 November 1993 8
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- tabs down to a single blank, and throws away whitespace
- found at the end of a line:
-
- %%
- [ \t]+ putchar( ' ' );
- [ \t]+$ /* ignore this token */
-
-
- If the action contains a '{', then the action spans till
- the balancing '}' is found, and the action may cross mul-
- tiple lines. _f_l_e_x knows about C strings and comments and
- won't be fooled by braces found within them, but also
- allows actions to begin with %%{{ and will consider the
- action to be all the text up to the next %%}} (regardless of
- ordinary braces inside the action).
-
- An action consisting solely of a vertical bar ('|') means
- "same as the action for the next rule." See below for an
- illustration.
-
- Actions can include arbitrary C code, including rreettuurrnn
- statements to return a value to whatever routine called
- yyyylleexx(()).. Each time yyyylleexx(()) is called it continues pro-
- cessing tokens from where it last left off until it either
- reaches the end of the file or executes a return.
-
- Actions are free to modify yyyytteexxtt except for lengthening
- it (adding characters to its end--these will overwrite
- later characters in the input stream). Modifying the
- final character of yytext may alter whether when scanning
- resumes rules anchored with '^' are active. Specifically,
- changing the final character of yytext to a newline will
- activate such rules on the next scan, and changing it to
- anything else will deactivate the rules. Users should not
- rely on this behavior being present in future releases.
- Finally, note that none of this paragraph applies when
- using %%aarrrraayy (see above).
-
- Actions are free to modify yyyylleenngg except they should not
- do so if the action also includes use of yyyymmoorree(()) (see
- below).
-
- There are a number of special directives which can be
- included within an action:
-
- - EECCHHOO copies yytext to the scanner's output.
-
- - BBEEGGIINN followed by the name of a start condition
- places the scanner in the corresponding start con-
- dition (see below).
-
- - RREEJJEECCTT directs the scanner to proceed on to the
- "second best" rule which matched the input (or a
- prefix of the input). The rule is chosen as
-
-
-
- Version 2.4 November 1993 9
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- described above in "How the Input is Matched", and
- yyyytteexxtt and yyyylleenngg set up appropriately. It may
- either be one which matched as much text as the
- originally chosen rule but came later in the _f_l_e_x
- input file, or one which matched less text. For
- example, the following will both count the words in
- the input and call the routine special() whenever
- "frob" is seen:
-
- int word_count = 0;
- %%
-
- frob special(); REJECT;
- [^ \t\n]+ ++word_count;
-
- Without the RREEJJEECCTT,, any "frob"'s in the input would
- not be counted as words, since the scanner normally
- executes only one action per token. Multiple
- RREEJJEECCTT''ss are allowed, each one finding the next
- best choice to the currently active rule. For
- example, when the following scanner scans the token
- "abcd", it will write "abcdabcaba" to the output:
-
- %%
- a |
- ab |
- abc |
- abcd ECHO; REJECT;
- .|\n /* eat up any unmatched character */
-
- (The first three rules share the fourth's action
- since they use the special '|' action.) RREEJJEECCTT is
- a particularly expensive feature in terms scanner
- performance; if it is used in _a_n_y of the scanner's
- actions it will slow down _a_l_l of the scanner's
- matching. Furthermore, RREEJJEECCTT cannot be used with
- the _-_C_f or _-_C_F options (see below).
-
- Note also that unlike the other special actions,
- RREEJJEECCTT is a _b_r_a_n_c_h_; code immediately following it
- in the action will _n_o_t be executed.
-
- - yyyymmoorree(()) tells the scanner that the next time it
- matches a rule, the corresponding token should be
- _a_p_p_e_n_d_e_d onto the current value of yyyytteexxtt rather
- than replacing it. For example, given the input
- "mega-kludge" the following will write "mega-mega-
- kludge" to the output:
-
- %%
- mega- ECHO; yymore();
- kludge ECHO;
-
- First "mega-" is matched and echoed to the output.
-
-
-
- Version 2.4 November 1993 10
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- Then "kludge" is matched, but the previous "mega-"
- is still hanging around at the beginning of yyyytteexxtt
- so the EECCHHOO for the "kludge" rule will actually
- write "mega-kludge". The presence of yyyymmoorree(()) in
- the scanner's action entails a minor performance
- penalty in the scanner's matching speed.
-
- - yyyylleessss((nn)) returns all but the first _n characters of
- the current token back to the input stream, where
- they will be rescanned when the scanner looks for
- the next match. yyyytteexxtt and yyyylleenngg are adjusted
- appropriately (e.g., yyyylleenngg will now be equal to _n
- ). For example, on the input "foobar" the follow-
- ing will write out "foobarbar":
-
- %%
- foobar ECHO; yyless(3);
- [a-z]+ ECHO;
-
- An argument of 0 to yyyylleessss will cause the entire
- current input string to be scanned again. Unless
- you've changed how the scanner will subsequently
- process its input (using BBEEGGIINN,, for example), this
- will result in an endless loop.
-
- Note that yyyylleessss is a macro and can only be used in the
- flex input file, not from other source files.
-
- - uunnppuutt((cc)) puts the character _c back onto the input
- stream. It will be the next character scanned.
- The following action will take the current token
- and cause it to be rescanned enclosed in parenthe-
- ses.
-
- {
- int i;
- unput( ')' );
- for ( i = yyleng - 1; i >= 0; --i )
- unput( yytext[i] );
- unput( '(' );
- }
-
- Note that since each uunnppuutt(()) puts the given charac-
- ter back at the _b_e_g_i_n_n_i_n_g of the input stream,
- pushing back strings must be done back-to-front.
- Also note that you cannot put back EEOOFF to attempt
- to mark the input stream with an end-of-file.
-
- - iinnppuutt(()) reads the next character from the input
- stream. For example, the following is one way to
- eat up C comments:
-
- %%
- "/*" {
-
-
-
- Version 2.4 November 1993 11
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- register int c;
-
- for ( ; ; )
- {
- while ( (c = input()) != '*' &&
- c != EOF )
- ; /* eat up text of comment */
-
- if ( c == '*' )
- {
- while ( (c = input()) == '*' )
- ;
- if ( c == '/' )
- break; /* found the end */
- }
-
- if ( c == EOF )
- {
- error( "EOF in comment" );
- break;
- }
- }
- }
-
- (Note that if the scanner is compiled using CC++++,,
- then iinnppuutt(()) is instead referred to as yyyyiinnppuutt(()),,
- in order to avoid a name clash with the CC++++ stream
- by the name of _i_n_p_u_t_._)
-
- - yyyytteerrmmiinnaattee(()) can be used in lieu of a return
- statement in an action. It terminates the scanner
- and returns a 0 to the scanner's caller, indicating
- "all done". By default, yyyytteerrmmiinnaattee(()) is also
- called when an end-of-file is encountered. It is a
- macro and may be redefined.
-
- TTHHEE GGEENNEERRAATTEEDD SSCCAANNNNEERR
- The output of _f_l_e_x is the file lleexx..yyyy..cc,, which contains
- the scanning routine yyyylleexx(()),, a number of tables used by
- it for matching tokens, and a number of auxiliary routines
- and macros. By default, yyyylleexx(()) is declared as follows:
-
- int yylex()
- {
- ... various definitions and the actions in here ...
- }
-
- (If your environment supports function prototypes, then it
- will be "int yylex( void )".) This definition may be
- changed by defining the "YY_DECL" macro. For example, you
- could use:
-
- #define YY_DECL float lexscan( a, b ) float a, b;
-
-
-
-
- Version 2.4 November 1993 12
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- to give the scanning routine the name _l_e_x_s_c_a_n_, returning a
- float, and taking two floats as arguments. Note that if
- you give arguments to the scanning routine using a K&R-
- style/non-prototyped function declaration, you must termi-
- nate the definition with a semi-colon (;).
-
- Whenever yyyylleexx(()) is called, it scans tokens from the
- global input file _y_y_i_n (which defaults to stdin). It con-
- tinues until it either reaches an end-of-file (at which
- point it returns the value 0) or one of its actions exe-
- cutes a _r_e_t_u_r_n statement.
-
- If the scanner reaches an end-of-file, subsequent calls
- are undefined unless either _y_y_i_n is pointed at a new input
- file (in which case scanning continues from that file), or
- yyyyrreessttaarrtt(()) is called. yyyyrreessttaarrtt(()) takes one argument, a
- FFIILLEE ** pointer, and initializes _y_y_i_n for scanning from
- that file. Essentially there is no difference between
- just assigning _y_y_i_n to a new input file or using
- yyyyrreessttaarrtt(()) to do so; the latter is available for compati-
- bility with previous versions of _f_l_e_x_, and because it can
- be used to switch input files in the middle of scanning.
- It can also be used to throw away the current input
- buffer, by calling it with an argument of _y_y_i_n_.
-
- If yyyylleexx(()) stops scanning due to executing a _r_e_t_u_r_n state-
- ment in one of the actions, the scanner may then be called
- again and it will resume scanning where it left off.
-
- By default (and for purposes of efficiency), the scanner
- uses block-reads rather than simple _g_e_t_c_(_) calls to read
- characters from _y_y_i_n_. The nature of how it gets its input
- can be controlled by defining the YYYY__IINNPPUUTT macro.
- YY_INPUT's calling sequence is
- "YY_INPUT(buf,result,max_size)". Its action is to place
- up to _m_a_x___s_i_z_e characters in the character array _b_u_f and
- return in the integer variable _r_e_s_u_l_t either the number of
- characters read or the constant YY_NULL (0 on Unix sys-
- tems) to indicate EOF. The default YY_INPUT reads from
- the global file-pointer "yyin".
-
- A sample definition of YY_INPUT (in the definitions sec-
- tion of the input file):
-
- %{
- #define YY_INPUT(buf,result,max_size) \
- { \
- int c = getchar(); \
- result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
- }
- %}
-
- This definition will change the input processing to occur
- one character at a time.
-
-
-
- Version 2.4 November 1993 13
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- You also can add in things like keeping track of the input
- line number this way; but don't expect your scanner to go
- very fast.
-
- When the scanner receives an end-of-file indication from
- YY_INPUT, it then checks the yyyywwrraapp(()) function. If
- yyyywwrraapp(()) returns false (zero), then it is assumed that the
- function has gone ahead and set up _y_y_i_n to point to
- another input file, and scanning continues. If it returns
- true (non-zero), then the scanner terminates, returning 0
- to its caller.
-
- The default yyyywwrraapp(()) always returns 1.
-
- The scanner writes its EECCHHOO output to the _y_y_o_u_t global
- (default, stdout), which may be redefined by the user sim-
- ply by assigning it to some other FFIILLEE pointer.
-
- SSTTAARRTT CCOONNDDIITTIIOONNSS
- _f_l_e_x provides a mechanism for conditionally activating
- rules. Any rule whose pattern is prefixed with "<sc>"
- will only be active when the scanner is in the start con-
- dition named "sc". For example,
-
- <STRING>[^"]* { /* eat up the string body ... */
- ...
- }
-
- will be active only when the scanner is in the "STRING"
- start condition, and
-
- <INITIAL,STRING,QUOTE>\. { /* handle an escape ... */
- ...
- }
-
- will be active only when the current start condition is
- either "INITIAL", "STRING", or "QUOTE".
-
- Start conditions are declared in the definitions (first)
- section of the input using unindented lines beginning with
- either %%ss or %%xx followed by a list of names. The former
- declares _i_n_c_l_u_s_i_v_e start conditions, the latter _e_x_c_l_u_s_i_v_e
- start conditions. A start condition is activated using
- the BBEEGGIINN action. Until the next BBEEGGIINN action is exe-
- cuted, rules with the given start condition will be active
- and rules with other start conditions will be inactive.
- If the start condition is _i_n_c_l_u_s_i_v_e_, then rules with no
- start conditions at all will also be active. If it is
- _e_x_c_l_u_s_i_v_e_, then _o_n_l_y rules qualified with the start condi-
- tion will be active. A set of rules contingent on the
- same exclusive start condition describe a scanner which is
- independent of any of the other rules in the _f_l_e_x input.
- Because of this, exclusive start conditions make it easy
- to specify "mini-scanners" which scan portions of the
-
-
-
- Version 2.4 November 1993 14
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- input that are syntactically different from the rest
- (e.g., comments).
-
- If the distinction between inclusive and exclusive start
- conditions is still a little vague, here's a simple exam-
- ple illustrating the connection between the two. The set
- of rules:
-
- %s example
- %%
- <example>foo /* do something */
-
- is equivalent to
-
- %x example
- %%
- <INITIAL,example>foo /* do something */
-
-
- Also note that the special start-condition specifier <<**>>
- matches every start condition. Thus, the above example
- could also have been written;
-
- %x example
- %%
- <*>foo /* do something */
-
-
- The default rule (to EECCHHOO any unmatched character) remains
- active in start conditions.
-
- BBEEGGIINN((00)) returns to the original state where only the
- rules with no start conditions are active. This state can
- also be referred to as the start-condition "INITIAL", so
- BBEEGGIINN((IINNIITTIIAALL)) is equivalent to BBEEGGIINN((00)).. (The parenthe-
- ses around the start condition name are not required but
- are considered good style.)
-
- BBEEGGIINN actions can also be given as indented code at the
- beginning of the rules section. For example, the follow-
- ing will cause the scanner to enter the "SPECIAL" start
- condition whenever _y_y_l_e_x_(_) is called and the global vari-
- able _e_n_t_e_r___s_p_e_c_i_a_l is true:
-
- int enter_special;
-
- %x SPECIAL
- %%
- if ( enter_special )
- BEGIN(SPECIAL);
-
- <SPECIAL>blahblahblah
- ...more rules follow...
-
-
-
-
- Version 2.4 November 1993 15
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- To illustrate the uses of start conditions, here is a
- scanner which provides two different interpretations of a
- string like "123.456". By default it will treat it as as
- three tokens, the integer "123", a dot ('.'), and the
- integer "456". But if the string is preceded earlier in
- the line by the string "expect-floats" it will treat it as
- a single token, the floating-point number 123.456:
-
- %{
- #include <math.h>
- %}
- %s expect
-
- %%
- expect-floats BEGIN(expect);
-
- <expect>[0-9]+"."[0-9]+ {
- printf( "found a float, = %f\n",
- atof( yytext ) );
- }
- <expect>\n {
- /* that's the end of the line, so
- * we need another "expect-number"
- * before we'll recognize any more
- * numbers
- */
- BEGIN(INITIAL);
- }
-
- [0-9]+ {
- printf( "found an integer, = %d\n",
- atoi( yytext ) );
- }
-
- "." printf( "found a dot\n" );
-
- Here is a scanner which recognizes (and discards) C com-
- ments while maintaining a count of the current input line.
-
- %x comment
- %%
- int line_num = 1;
-
- "/*" BEGIN(comment);
-
- <comment>[^*\n]* /* eat anything that's not a '*' */
- <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
- <comment>\n ++line_num;
- <comment>"*"+"/" BEGIN(INITIAL);
-
- This scanner goes to a bit of trouble to match as much
- text as possible with each rule. In general, when
- attempting to write a high-speed scanner try to match as
- much possible in each rule, as it's a big win.
-
-
-
- Version 2.4 November 1993 16
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- Note that start-conditions names are really integer values
- and can be stored as such. Thus, the above could be
- extended in the following fashion:
-
- %x comment foo
- %%
- int line_num = 1;
- int comment_caller;
-
- "/*" {
- comment_caller = INITIAL;
- BEGIN(comment);
- }
-
- ...
-
- <foo>"/*" {
- comment_caller = foo;
- BEGIN(comment);
- }
-
- <comment>[^*\n]* /* eat anything that's not a '*' */
- <comment>"*"+[^*/\n]* /* eat up '*'s not followed by '/'s */
- <comment>\n ++line_num;
- <comment>"*"+"/" BEGIN(comment_caller);
-
- Furthermore, you can access the current start condition
- using the integer-valued YYYY__SSTTAARRTT macro. For example, the
- above assignments to _c_o_m_m_e_n_t___c_a_l_l_e_r could instead be writ-
- ten
-
- comment_caller = YY_START;
-
- Note that start conditions do not have their own name-
- space; %s's and %x's declare names in the same fashion as
- #define's.
-
- Finally, here's an example of how to match C-style quoted
- strings using exclusive start conditions, including
- expanded escape sequences (but not including checking for
- a string that's too long):
-
- %x str
-
- %%
- char string_buf[MAX_STR_CONST];
- char *string_buf_ptr;
-
-
- \" string_buf_ptr = string_buf; BEGIN(str);
-
- <str>\" { /* saw closing quote - all done */
- BEGIN(INITIAL);
- *string_buf_ptr = '\0';
-
-
-
- Version 2.4 November 1993 17
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- /* return string constant token type and
- * value to parser
- */
- }
-
- <str>\n {
- /* error - unterminated string constant */
- /* generate error message */
- }
-
- <str>\\[0-7]{1,3} {
- /* octal escape sequence */
- int result;
-
- (void) sscanf( yytext + 1, "%o", &result );
-
- if ( result > 0xff )
- /* error, constant is out-of-bounds */
-
- *string_buf_ptr++ = result;
- }
-
- <str>\\[0-9]+ {
- /* generate error - bad escape sequence; something
- * like '\48' or '\0777777'
- */
- }
-
- <str>\\n *string_buf_ptr++ = '\n';
- <str>\\t *string_buf_ptr++ = '\t';
- <str>\\r *string_buf_ptr++ = '\r';
- <str>\\b *string_buf_ptr++ = '\b';
- <str>\\f *string_buf_ptr++ = '\f';
-
- <str>\\(.|\n) *string_buf_ptr++ = yytext[1];
-
- <str>[^\\\n\"]+ {
- char *yytext_ptr = yytext;
-
- while ( *yytext_ptr )
- *string_buf_ptr++ = *yytext_ptr++;
- }
-
-
- MMUULLTTIIPPLLEE IINNPPUUTT BBUUFFFFEERRSS
- Some scanners (such as those which support "include"
- files) require reading from several input streams. As
- _f_l_e_x scanners do a large amount of buffering, one cannot
- control where the next input will be read from by simply
- writing a YYYY__IINNPPUUTT which is sensitive to the scanning con-
- text. YYYY__IINNPPUUTT is only called when the scanner reaches
- the end of its buffer, which may be a long time after
- scanning a statement such as an "include" which requires
- switching the input source.
-
-
-
- Version 2.4 November 1993 18
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- To negotiate these sorts of problems, _f_l_e_x provides a
- mechanism for creating and switching between multiple
- input buffers. An input buffer is created by using:
-
- YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
-
- which takes a _F_I_L_E pointer and a size and creates a buffer
- associated with the given file and large enough to hold
- _s_i_z_e characters (when in doubt, use YYYY__BBUUFF__SSIIZZEE for the
- size). It returns a YYYY__BBUUFFFFEERR__SSTTAATTEE handle, which may
- then be passed to other routines:
-
- void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
-
- switches the scanner's input buffer so subsequent tokens
- will come from _n_e_w___b_u_f_f_e_r_. Note that
- yyyy__sswwiittcchh__ttoo__bbuuffffeerr(()) may be used by yywrap() to set
- things up for continued scanning, instead of opening a new
- file and pointing _y_y_i_n at it.
-
- void yy_delete_buffer( YY_BUFFER_STATE buffer )
-
- is used to reclaim the storage associated with a buffer.
-
- yyyy__nneeww__bbuuffffeerr(()) is an alias for yyyy__ccrreeaattee__bbuuffffeerr(()),, pro-
- vided for compatibility with the C++ use of _n_e_w and _d_e_l_e_t_e
- for creating and destroying dynamic objects.
-
- Finally, the YYYY__CCUURRRREENNTT__BBUUFFFFEERR macro returns a
- YYYY__BBUUFFFFEERR__SSTTAATTEE handle to the current buffer.
-
- Here is an example of using these features for writing a
- scanner which expands include files (the <<<<EEOOFF>>>> feature
- is discussed below):
-
- /* the "incl" state is used for picking up the name
- * of an include file
- */
- %x incl
-
- %{
- #define MAX_INCLUDE_DEPTH 10
- YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
- int include_stack_ptr = 0;
- %}
-
- %%
- include BEGIN(incl);
-
- [a-z]+ ECHO;
- [^a-z\n]*\n? ECHO;
-
- <incl>[ \t]* /* eat the whitespace */
- <incl>[^ \t\n]+ { /* got the include file name */
-
-
-
- Version 2.4 November 1993 19
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
- {
- fprintf( stderr, "Includes nested too deeply" );
- exit( 1 );
- }
-
- include_stack[include_stack_ptr++] =
- YY_CURRENT_BUFFER;
-
- yyin = fopen( yytext, "r" );
-
- if ( ! yyin )
- error( ... );
-
- yy_switch_to_buffer(
- yy_create_buffer( yyin, YY_BUF_SIZE ) );
-
- BEGIN(INITIAL);
- }
-
- <<EOF>> {
- if ( --include_stack_ptr < 0 )
- {
- yyterminate();
- }
-
- else
- {
- yy_delete_buffer( YY_CURRENT_BUFFER );
- yy_switch_to_buffer(
- include_stack[include_stack_ptr] );
- }
- }
-
-
- EENNDD--OOFF--FFIILLEE RRUULLEESS
- The special rule "<<EOF>>" indicates actions which are to
- be taken when an end-of-file is encountered and yywrap()
- returns non-zero (i.e., indicates no further files to pro-
- cess). The action must finish by doing one of four
- things:
-
- - assigning _y_y_i_n to a new input file (in previous
- versions of flex, after doing the assignment you
- had to call the special action YYYY__NNEEWW__FFIILLEE;; this is
- no longer necessary);
-
- - executing a _r_e_t_u_r_n statement;
-
- - executing the special yyyytteerrmmiinnaattee(()) action;
-
- - or, switching to a new buffer using
- yyyy__sswwiittcchh__ttoo__bbuuffffeerr(()) as shown in the example
- above.
-
-
-
- Version 2.4 November 1993 20
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- <<EOF>> rules may not be used with other patterns; they
- may only be qualified with a list of start conditions. If
- an unqualified <<EOF>> rule is given, it applies to _a_l_l
- start conditions which do not already have <<EOF>>
- actions. To specify an <<EOF>> rule for only the initial
- start condition, use
-
- <INITIAL><<EOF>>
-
-
- These rules are useful for catching things like unclosed
- comments. An example:
-
- %x quote
- %%
-
- ...other rules for dealing with quotes...
-
- <quote><<EOF>> {
- error( "unterminated quote" );
- yyterminate();
- }
- <<EOF>> {
- if ( *++filelist )
- yyin = fopen( *filelist, "r" );
- else
- yyterminate();
- }
-
-
- MMIISSCCEELLLLAANNEEOOUUSS MMAACCRROOSS
- The macro YY_USER_ACTION can be defined to provide an
- action which is always executed prior to the matched
- rule's action. For example, it could be #define'd to call
- a routine to convert yytext to lower-case.
-
- The macro YYYY__UUSSEERR__IINNIITT may be defined to provide an action
- which is always executed before the first scan (and before
- the scanner's internal initializations are done). For
- example, it could be used to call a routine to read in a
- data table or open a logging file.
-
- In the generated scanner, the actions are all gathered in
- one large switch statement and separated using YYYY__BBRREEAAKK,,
- which may be redefined. By default, it is simply a
- "break", to separate each rule's action from the following
- rule's. Redefining YYYY__BBRREEAAKK allows, for example, C++
- users to #define YY_BREAK to do nothing (while being very
- careful that every rule ends with a "break" or a
- "return"!) to avoid suffering from unreachable statement
- warnings where because a rule's action ends with "return",
- the YYYY__BBRREEAAKK is inaccessible.
-
-
-
-
-
- Version 2.4 November 1993 21
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- IINNTTEERRFFAACCIINNGG WWIITTHH YYAACCCC
- One of the main uses of _f_l_e_x is as a companion to the _y_a_c_c
- parser-generator. _y_a_c_c parsers expect to call a routine
- named yyyylleexx(()) to find the next input token. The routine
- is supposed to return the type of the next token as well
- as putting any associated value in the global yyyyllvvaall.. To
- use _f_l_e_x with _y_a_c_c_, one specifies the --dd option to _y_a_c_c to
- instruct it to generate the file yy..ttaabb..hh containing defi-
- nitions of all the %%ttookkeennss appearing in the _y_a_c_c input.
- This file is then included in the _f_l_e_x scanner. For exam-
- ple, if one of the tokens is "TOK_NUMBER", part of the
- scanner might look like:
-
- %{
- #include "y.tab.h"
- %}
-
- %%
-
- [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
-
-
- OOPPTTIIOONNSS
- _f_l_e_x has the following options:
-
- --bb Generate backing-up information to _l_e_x_._b_a_c_k_u_p_.
- This is a list of scanner states which require
- backing up and the input characters on which they
- do so. By adding rules one can remove backing-up
- states. If all backing-up states are eliminated
- and --CCff or --CCFF is used, the generated scanner will
- run faster (see the --pp flag). Only users who wish
- to squeeze every last cycle out of their scanners
- need worry about this option. (See the section on
- Performance Considerations below.)
-
- --cc is a do-nothing, deprecated option included for
- POSIX compliance.
-
- NNOOTTEE:: in previous releases of _f_l_e_x --cc specified
- table-compression options. This functionality is
- now given by the --CC flag. To ease the the impact
- of this change, when _f_l_e_x encounters --cc,, it cur-
- rently issues a warning message and assumes that --CC
- was desired instead. In the future this "promo-
- tion" of --cc to --CC will go away in the name of full
- POSIX compliance (unless the POSIX meaning is
- removed first).
-
- --dd makes the generated scanner run in _d_e_b_u_g mode.
- Whenever a pattern is recognized and the global
- yyyy__fflleexx__ddeebbuugg is non-zero (which is the default),
- the scanner will write to _s_t_d_e_r_r a line of the
- form:
-
-
-
- Version 2.4 November 1993 22
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- --accepting rule at line 53 ("the matched text")
-
- The line number refers to the location of the rule
- in the file defining the scanner (i.e., the file
- that was fed to flex). Messages are also generated
- when the scanner backs up, accepts the default
- rule, reaches the end of its input buffer (or
- encounters a NUL; at this point, the two look the
- same as far as the scanner's concerned), or reaches
- an end-of-file.
-
- --ff specifies _f_a_s_t _s_c_a_n_n_e_r_. No table compression is
- done and stdio is bypassed. The result is large
- but fast. This option is equivalent to --CCffrr (see
- below).
-
- --hh generates a "help" summary of _f_l_e_x_'_s options to
- _s_t_d_e_r_r and then exits.
-
- --ii instructs _f_l_e_x to generate a _c_a_s_e_-_i_n_s_e_n_s_i_t_i_v_e scan-
- ner. The case of letters given in the _f_l_e_x input
- patterns will be ignored, and tokens in the input
- will be matched regardless of case. The matched
- text given in _y_y_t_e_x_t will have the preserved case
- (i.e., it will not be folded).
-
- --ll turns on maximum compatibility with the original
- AT&T _l_e_x implementation. Note that this does not
- mean _f_u_l_l compatibility. Use of this option costs
- a considerable amount of performance, and it cannot
- be used with the --++,, --ff,, --FF,, --CCff,, or --CCFF options.
- For details on the compatibilities it provides, see
- the section "Incompatibilities With Lex And POSIX"
- below.
-
- --nn is another do-nothing, deprecated option included
- only for POSIX compliance.
-
- --pp generates a performance report to stderr. The
- report consists of comments regarding features of
- the _f_l_e_x input file which will cause a serious loss
- of performance in the resulting scanner. If you
- give the flag twice, you will also get comments
- regarding features that lead to minor performance
- losses.
-
- Note that the use of RREEJJEECCTT and variable trailing
- context (see the Bugs section in flex(1)) entails a
- substantial performance penalty; use of _y_y_m_o_r_e_(_)_,
- the ^^ operator, and the --II flag entail minor per-
- formance penalties.
-
- --ss causes the _d_e_f_a_u_l_t _r_u_l_e (that unmatched scanner
- input is echoed to _s_t_d_o_u_t_) to be suppressed. If
-
-
-
- Version 2.4 November 1993 23
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- the scanner encounters input that does not match
- any of its rules, it aborts with an error. This
- option is useful for finding holes in a scanner's
- rule set.
-
- --tt instructs _f_l_e_x to write the scanner it generates to
- standard output instead of lleexx..yyyy..cc..
-
- --vv specifies that _f_l_e_x should write to _s_t_d_e_r_r a sum-
- mary of statistics regarding the scanner it gener-
- ates. Most of the statistics are meaningless to
- the casual _f_l_e_x user, but the first line identifies
- the version of _f_l_e_x (same as reported by --VV)),, and
- the next line the flags used when generating the
- scanner, including those that are on by default.
-
- --ww suppresses warning messages.
-
- --BB instructs _f_l_e_x to generate a _b_a_t_c_h scanner, the
- opposite of _i_n_t_e_r_a_c_t_i_v_e scanners generated by --II
- (see below). In general, you use --BB when you are
- _c_e_r_t_a_i_n that your scanner will never be used inter-
- actively, and you want to squeeze a _l_i_t_t_l_e more
- performance out of it. If your goal is instead to
- squeeze out a _l_o_t more performance, you should be
- using the --CCff or --CCFF options (discussed below),
- which turn on --BB automatically anyway.
-
- --FF specifies that the _f_a_s_t scanner table representa-
- tion should be used (and stdio bypassed). This
- representation is about as fast as the full table
- representation ((--ff)),, and for some sets of patterns
- will be considerably smaller (and for others,
- larger). In general, if the pattern set contains
- both "keywords" and a catch-all, "identifier" rule,
- such as in the set:
-
- "case" return TOK_CASE;
- "switch" return TOK_SWITCH;
- ...
- "default" return TOK_DEFAULT;
- [a-z]+ return TOK_ID;
-
- then you're better off using the full table repre-
- sentation. If only the "identifier" rule is pre-
- sent and you then use a hash table or some such to
- detect the keywords, you're better off using --FF..
-
- This option is equivalent to --CCFFrr (see below). It
- cannot be used with --++..
-
- --II instructs _f_l_e_x to generate an _i_n_t_e_r_a_c_t_i_v_e scanner.
- An interactive scanner is one that only looks ahead
- to decide what token has been matched if it
-
-
-
- Version 2.4 November 1993 24
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- absolutely must. It turns out that always looking
- one extra character ahead, even if the scanner has
- already seen enough text to disambiguate the cur-
- rent token, is a bit faster than only looking ahead
- when necessary. But scanners that always look
- ahead give dreadful interactive performance; for
- example, when a user types a newline, it is not
- recognized as a newline token until they enter
- _a_n_o_t_h_e_r token, which often means typing in another
- whole line.
-
- _F_l_e_x scanners default to _i_n_t_e_r_a_c_t_i_v_e unless you use
- the --CCff or --CCFF table-compression options (see
- below). That's because if you're looking for high-
- performance you should be using one of these
- options, so if you didn't, _f_l_e_x assumes you'd
- rather trade off a bit of run-time performance for
- intuitive interactive behavior. Note also that you
- _c_a_n_n_o_t use --II in conjunction with --CCff or --CCFF..
- Thus, this option is not really needed; it is on by
- default for all those cases in which it is allowed.
-
- You can force a scanner to _n_o_t be interactive by
- using --BB (see above).
-
- --LL instructs _f_l_e_x not to generate ##lliinnee directives.
- Without this option, _f_l_e_x peppers the generated
- scanner with #line directives so error messages in
- the actions will be correctly located with respect
- to the original _f_l_e_x input file, and not to the
- fairly meaningless line numbers of lleexx..yyyy..cc..
- (Unfortunately _f_l_e_x does not presently generate the
- necessary directives to "retarget" the line numbers
- for those parts of lleexx..yyyy..cc which it generated. So
- if there is an error in the generated code, a mean-
- ingless line number is reported.)
-
- --TT makes _f_l_e_x run in _t_r_a_c_e mode. It will generate a
- lot of messages to _s_t_d_e_r_r concerning the form of
- the input and the resultant non-deterministic and
- deterministic finite automata. This option is
- mostly for use in maintaining _f_l_e_x_.
-
- --VV prints the version number to _s_t_d_e_r_r and exits.
-
- --77 instructs _f_l_e_x to generate a 7-bit scanner, i.e.,
- one which can only recognized 7-bit characters in
- its input. The advantage of using --77 is that the
- scanner's tables can be up to half the size of
- those generated using the --88 option (see below).
- The disadvantage is that such scanners often hang
- or crash if their input contains an 8-bit charac-
- ter.
-
-
-
-
- Version 2.4 November 1993 25
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- Note, however, that unless you generate your scan-
- ner using the --CCff or --CCFF table compression options,
- use of --77 will save only a small amount of table
- space, and make your scanner considerably less
- portable. _F_l_e_x_'_s default behavior is to generate
- an 8-bit scanner unless you use the --CCff or --CCFF,, in
- which case _f_l_e_x defaults to generating 7-bit scan-
- ners unless your site was always configured to gen-
- erate 8-bit scanners (as will often be the case
- with non-USA sites). You can tell whether flex
- generated a 7-bit or an 8-bit scanner by inspecting
- the flag summary in the --vv output as described
- above.
-
- Note that if you use --CCffee or --CCFFee (those table com-
- pression options, but also using equivalence
- classes as discussed see below), flex still
- defaults to generating an 8-bit scanner, since usu-
- ally with these compression options full 8-bit
- tables are not much more expensive than 7-bit
- tables.
-
- --88 instructs _f_l_e_x to generate an 8-bit scanner, i.e.,
- one which can recognize 8-bit characters. This
- flag is only needed for scanners generated using
- --CCff or --CCFF,, as otherwise flex defaults to generat-
- ing an 8-bit scanner anyway.
-
- See the discussion of --77 above for flex's default
- behavior and the tradeoffs between 7-bit and 8-bit
- scanners.
-
- --++ specifies that you want flex to generate a C++
- scanner class. See the section on Generating C++
- Scanners below for details.
-
- --CC[[aaeeffFFmmrr]]
- controls the degree of table compression and, more
- generally, trade-offs between small scanners and
- fast scanners.
-
- --CCaa ("align") instructs flex to trade off larger
- tables in the generated scanner for faster perfor-
- mance because the elements of the tables are better
- aligned for memory access and computation. On some
- RISC architectures, fetching and manipulating long-
- words is more efficient than with smaller-sized
- datums such as shortwords. This option can double
- the size of the tables used by your scanner.
-
- --CCee directs _f_l_e_x to construct _e_q_u_i_v_a_l_e_n_c_e _c_l_a_s_s_e_s_,
- i.e., sets of characters which have identical lexi-
- cal properties (for example, if the only appearance
- of digits in the _f_l_e_x input is in the character
-
-
-
- Version 2.4 November 1993 26
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- class "[0-9]" then the digits '0', '1', ..., '9'
- will all be put in the same equivalence class).
- Equivalence classes usually give dramatic reduc-
- tions in the final table/object file sizes (typi-
- cally a factor of 2-5) and are pretty cheap perfor-
- mance-wise (one array look-up per character
- scanned).
-
- --CCff specifies that the _f_u_l_l scanner tables should
- be generated - _f_l_e_x should not compress the tables
- by taking advantages of similar transition func-
- tions for different states.
-
- --CCFF specifies that the alternate fast scanner rep-
- resentation (described above under the --FF flag)
- should be used. This option cannot be used with
- --++..
-
- --CCmm directs _f_l_e_x to construct _m_e_t_a_-_e_q_u_i_v_a_l_e_n_c_e
- _c_l_a_s_s_e_s_, which are sets of equivalence classes (or
- characters, if equivalence classes are not being
- used) that are commonly used together. Meta-
- equivalence classes are often a big win when using
- compressed tables, but they have a moderate perfor-
- mance impact (one or two "if" tests and one array
- look-up per character scanned).
-
- --CCrr causes the generated scanner to _b_y_p_a_s_s use of
- the standard I/O library (stdio) for input.
- Instead of calling ffrreeaadd(()) or ggeettcc(()),, the scanner
- will use the rreeaadd(()) system call, resulting in a
- performance gain which varies from system to sys-
- tem, but in general is probably negligible unless
- you are also using --CCff or --CCFF.. Using --CCrr can cause
- strange behavior if, for example, you read from
- _y_y_i_n using stdio prior to calling the scanner
- (because the scanner will miss whatever text your
- previous reads left in the stdio input buffer).
-
- --CCrr has no effect if you define YYYY__IINNPPUUTT (see The
- Generated Scanner above).
-
- A lone --CC specifies that the scanner tables should
- be compressed but neither equivalence classes nor
- meta-equivalence classes should be used.
-
- The options --CCff or --CCFF and --CCmm do not make sense
- together - there is no opportunity for meta-
- equivalence classes if the table is not being com-
- pressed. Otherwise the options may be freely
- mixed, and are cumulative.
-
- The default setting is --CCeemm,, which specifies that
- _f_l_e_x should generate equivalence classes and meta-
-
-
-
- Version 2.4 November 1993 27
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- equivalence classes. This setting provides the
- highest degree of table compression. You can trade
- off faster-executing scanners at the cost of larger
- tables with the following generally being true:
-
- slowest & smallest
- -Cem
- -Cm
- -Ce
- -C
- -C{f,F}e
- -C{f,F}
- -C{f,F}a
- fastest & largest
-
- Note that scanners with the smallest tables are
- usually generated and compiled the quickest, so
- during development you will usually want to use the
- default, maximal compression.
-
- --CCffee is often a good compromise between speed and
- size for production scanners.
-
- --PPpprreeffiixx
- changes the default _y_y prefix used by _f_l_e_x for all
- globally-visible variable and function names to
- instead be _p_r_e_f_i_x_. For example, --PPffoooo changes the
- name of yyyytteexxtt to ffooootteexxtt.. It also changes the
- name of the default output file from lleexx..yyyy..cc to
- lleexx..ffoooo..cc.. Here are all of the names affected:
-
- yyFlexLexer
- yy_create_buffer
- yy_delete_buffer
- yy_flex_debug
- yy_init_buffer
- yy_load_buffer_state
- yy_switch_to_buffer
- yyin
- yyleng
- yylex
- yyout
- yyrestart
- yytext
- yywrap
-
- Within your scanner itself, you can still refer to
- the global variables and functions using either
- version of their name; but eternally, they have the
- modified name.
-
- This option lets you easily link together multiple
- _f_l_e_x programs into the same executable. Note,
- though, that using this option also renames
-
-
-
- Version 2.4 November 1993 28
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- yyyywwrraapp(()),, so you now _m_u_s_t provide your own (appro-
- priately-named) version of the routine for your
- scanner, as linking with --llffll no longer provides
- one for you by default.
-
- --SSsskkeelleettoonn__ffiillee
- overrides the default skeleton file from which _f_l_e_x
- constructs its scanners. You'll never need this
- option unless you are doing _f_l_e_x maintenance or
- development.
-
- PPEERRFFOORRMMAANNCCEE CCOONNSSIIDDEERRAATTIIOONNSS
- The main design goal of _f_l_e_x is that it generate high-
- performance scanners. It has been optimized for dealing
- well with large sets of rules. Aside from the effects on
- scanner speed of the table compression --CC options outlined
- above, there are a number of options/actions which degrade
- performance. These are, from most expensive to least:
-
- REJECT
-
- pattern sets that require backing up
- arbitrary trailing context
-
- yymore()
- '^' beginning-of-line operator
-
- with the first three all being quite expensive and the
- last two being quite cheap. Note also that uunnppuutt(()) is
- implemented as a routine call that potentially does quite
- a bit of work, while yyyylleessss(()) is a quite-cheap macro; so
- if just putting back some excess text you scanned, use
- yyyylleessss(())..
-
- RREEJJEECCTT should be avoided at all costs when performance is
- important. It is a particularly expensive option.
-
- Getting rid of backing up is messy and often may be an
- enormous amount of work for a complicated scanner. In
- principal, one begins by using the --bb flag to generate a
- _l_e_x_._b_a_c_k_u_p file. For example, on the input
-
- %%
- foo return TOK_KEYWORD;
- foobar return TOK_KEYWORD;
-
- the file looks like:
-
- State #6 is non-accepting -
- associated rule line numbers:
- 2 3
- out-transitions: [ o ]
- jam-transitions: EOF [ \001-n p-\177 ]
-
-
-
-
- Version 2.4 November 1993 29
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- State #8 is non-accepting -
- associated rule line numbers:
- 3
- out-transitions: [ a ]
- jam-transitions: EOF [ \001-` b-\177 ]
-
- State #9 is non-accepting -
- associated rule line numbers:
- 3
- out-transitions: [ r ]
- jam-transitions: EOF [ \001-q s-\177 ]
-
- Compressed tables always back up.
-
- The first few lines tell us that there's a scanner state
- in which it can make a transition on an 'o' but not on any
- other character, and that in that state the currently
- scanned text does not match any rule. The state occurs
- when trying to match the rules found at lines 2 and 3 in
- the input file. If the scanner is in that state and then
- reads something other than an 'o', it will have to back up
- to find a rule which is matched. With a bit of head-
- scratching one can see that this must be the state it's in
- when it has seen "fo". When this has happened, if any-
- thing other than another 'o' is seen, the scanner will
- have to back up to simply match the 'f' (by the default
- rule).
-
- The comment regarding State #8 indicates there's a problem
- when "foob" has been scanned. Indeed, on any character
- other than an 'a', the scanner will have to back up to
- accept "foo". Similarly, the comment for State #9 con-
- cerns when "fooba" has been scanned and an 'r' does not
- follow.
-
- The final comment reminds us that there's no point going
- to all the trouble of removing backing up from the rules
- unless we're using --CCff or --CCFF,, since there's no perfor-
- mance gain doing so with compressed scanners.
-
- The way to remove the backing up is to add "error" rules:
-
- %%
- foo return TOK_KEYWORD;
- foobar return TOK_KEYWORD;
-
- fooba |
- foob |
- fo {
- /* false alarm, not really a keyword */
- return TOK_ID;
- }
-
-
-
-
-
- Version 2.4 November 1993 30
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- Eliminating backing up among a list of keywords can also
- be done using a "catch-all" rule:
-
- %%
- foo return TOK_KEYWORD;
- foobar return TOK_KEYWORD;
-
- [a-z]+ return TOK_ID;
-
- This is usually the best solution when appropriate.
-
- Backing up messages tend to cascade. With a complicated
- set of rules it's not uncommon to get hundreds of mes-
- sages. If one can decipher them, though, it often only
- takes a dozen or so rules to eliminate the backing up
- (though it's easy to make a mistake and have an error rule
- accidentally match a valid token. A possible future _f_l_e_x
- feature will be to automatically add rules to eliminate
- backing up).
-
- _V_a_r_i_a_b_l_e trailing context (where both the leading and
- trailing parts do not have a fixed length) entails almost
- the same performance loss as RREEJJEECCTT (i.e., substantial).
- So when possible a rule like:
-
- %%
- mouse|rat/(cat|dog) run();
-
- is better written:
-
- %%
- mouse/cat|dog run();
- rat/cat|dog run();
-
- or as
-
- %%
- mouse|rat/cat run();
- mouse|rat/dog run();
-
- Note that here the special '|' action does _n_o_t provide any
- savings, and can even make things worse (see
-
- A final note regarding performance: as mentioned above in
- the section How the Input is Matched, dynamically resizing
- yyyytteexxtt to accomodate huge tokens is a slow process because
- it presently requires that the (huge) token be rescanned
- from the beginning. Thus if performance is vital, you
- should attempt to match "large" quantities of text but not
- "huge" quantities, where the cutoff between the two is at
- about 8K characters/token.
-
- Another area where the user can increase a scanner's per-
- formance (and one that's easier to implement) arises from
-
-
-
- Version 2.4 November 1993 31
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- the fact that the longer the tokens matched, the faster
- the scanner will run. This is because with long tokens
- the processing of most input characters takes place in the
- (short) inner scanning loop, and does not often have to go
- through the additional work of setting up the scanning
- environment (e.g., yyyytteexxtt)) for the action. Recall the
- scanner for C comments:
-
- %x comment
- %%
- int line_num = 1;
-
- "/*" BEGIN(comment);
-
- <comment>[^*\n]*
- <comment>"*"+[^*/\n]*
- <comment>\n ++line_num;
- <comment>"*"+"/" BEGIN(INITIAL);
-
- This could be sped up by writing it as:
-
- %x comment
- %%
- int line_num = 1;
-
- "/*" BEGIN(comment);
-
- <comment>[^*\n]*
- <comment>[^*\n]*\n ++line_num;
- <comment>"*"+[^*/\n]*
- <comment>"*"+[^*/\n]*\n ++line_num;
- <comment>"*"+"/" BEGIN(INITIAL);
-
- Now instead of each newline requiring the processing of
- another action, recognizing the newlines is "distributed"
- over the other rules to keep the matched text as long as
- possible. Note that _a_d_d_i_n_g rules does _n_o_t slow down the
- scanner! The speed of the scanner is independent of the
- number of rules or (modulo the considerations given at the
- beginning of this section) how complicated the rules are
- with regard to operators such as '*' and '|'.
-
- A final example in speeding up a scanner: suppose you want
- to scan through a file containing identifiers and key-
- words, one per line and with no other extraneous charac-
- ters, and recognize all the keywords. A natural first
- approach is:
-
- %%
- asm |
- auto |
- break |
- ... etc ...
- volatile |
-
-
-
- Version 2.4 November 1993 32
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- while /* it's a keyword */
-
- .|\n /* it's not a keyword */
-
- To eliminate the back-tracking, introduce a catch-all
- rule:
-
- %%
- asm |
- auto |
- break |
- ... etc ...
- volatile |
- while /* it's a keyword */
-
- [a-z]+ |
- .|\n /* it's not a keyword */
-
- Now, if it's guaranteed that there's exactly one word per
- line, then we can reduce the total number of matches by a
- half by merging in the recognition of newlines with that
- of the other tokens:
-
- %%
- asm\n |
- auto\n |
- break\n |
- ... etc ...
- volatile\n |
- while\n /* it's a keyword */
-
- [a-z]+\n |
- .|\n /* it's not a keyword */
-
- One has to be careful here, as we have now reintroduced
- backing up into the scanner. In particular, while _w_e know
- that there will never be any characters in the input
- stream other than letters or newlines, _f_l_e_x can't figure
- this out, and it will plan for possibly needing to back up
- when it has scanned a token like "auto" and then the next
- character is something other than a newline or a letter.
- Previously it would then just match the "auto" rule and be
- done, but now it has no "auto" rule, only a "auto\n" rule.
- To eliminate the possibility of backing up, we could
- either duplicate all rules but without final newlines, or,
- since we never expect to encounter such an input and
- therefore don't how it's classified, we can introduce one
- more catch-all rule, this one which doesn't include a new-
- line:
-
- %%
- asm\n |
- auto\n |
- break\n |
-
-
-
- Version 2.4 November 1993 33
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- ... etc ...
- volatile\n |
- while\n /* it's a keyword */
-
- [a-z]+\n |
- [a-z]+ |
- .|\n /* it's not a keyword */
-
- Compiled with --CCff,, this is about as fast as one can get a
- _f_l_e_x scanner to go for this particular problem.
-
- A final note: _f_l_e_x is slow when matching NUL's, particu-
- larly when a token contains multiple NUL's. It's best to
- write rules which match _s_h_o_r_t amounts of text if it's
- anticipated that the text will often include NUL's.
-
- GGEENNEERRAATTIINNGG CC++++ SSCCAANNNNEERRSS
- _f_l_e_x provides two different ways to generate scanners for
- use with C++. The first way is to simply compile a scan-
- ner generated by _f_l_e_x using a C++ compiler instead of a C
- compiler. You should not encounter any compilations
- errors (please report any you find to the email address
- given in the Author section below). You can then use C++
- code in your rule actions instead of C code. Note that
- the default input source for your scanner remains _y_y_i_n_,
- and default echoing is still done to _y_y_o_u_t_. Both of these
- remain _F_I_L_E _* variables and not C++ _s_t_r_e_a_m_s_.
-
- You can also use _f_l_e_x to generate a C++ scanner class,
- using the --++ option, which is automatically specified if
- the name of the flex executable ends in a '+', such as
- _f_l_e_x_+_+_. When using this option, flex defaults to generat-
- ing the scanner to the file lleexx..yyyy..cccc instead of lleexx..yyyy..cc..
- The generated scanner includes the header file
- _F_l_e_x_L_e_x_e_r_._h_, which defines the interface to two C++
- classes.
-
- The first class, FFlleexxLLeexxeerr,, provides an abstract base
- class defining the general scanner class interface. It
- provides the following member functions:
-
- ccoonnsstt cchhaarr** YYYYTTeexxtt(())
- returns the text of the most recently matched
- token, the equivalent of yyyytteexxtt..
-
- iinntt YYYYLLeenngg(())
- returns the length of the most recently matched
- token, the equivalent of yyyylleenngg..
-
- Also provided are member functions equivalent to
- yyyy__sswwiittcchh__ttoo__bbuuffffeerr(()),, yyyy__ccrreeaattee__bbuuffffeerr(()) (though the
- first argument is an iissttrreeaamm** object pointer and not a
- FFIILLEE**)),, yyyy__ddeelleettee__bbuuffffeerr(()),, and yyyyrreessttaarrtt(()) (again, the
- first argument is a iissttrreeaamm** object pointer).
-
-
-
- Version 2.4 November 1993 34
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- The second class defined in _F_l_e_x_L_e_x_e_r_._h is yyyyFFlleexxLLeexxeerr,,
- which is derived from FFlleexxLLeexxeerr.. It defines the following
- additional member functions:
-
- yyyyFFlleexxLLeexxeerr(( iissttrreeaamm** aarrgg__yyyyiinn == 00,, oossttrreeaamm** aarrgg__yyyyoouutt == 00
- ))
- constructs a yyyyFFlleexxLLeexxeerr object using the given
- streams for input and output. If not specified,
- the streams default to cciinn and ccoouutt,, respectively.
-
- vviirrttuuaall iinntt yyyylleexx(())
- performs the same role is yyyylleexx(()) does for ordinary
- flex scanners: it scans the input stream, consuming
- tokens, until a rule's action returns a value.
-
- In addition, yyyyFFlleexxLLeexxeerr defines the following protected
- virtual functions which you can redefine in derived
- classes to tailor the scanner:
-
- vviirrttuuaall iinntt LLeexxeerrIInnppuutt(( cchhaarr** bbuuff,, iinntt mmaaxx__ssiizzee ))
- reads up to mmaaxx__ssiizzee characters into bbuuff and
- returns the number of characters read. To indicate
- end-of-input, return 0 characters. Note that
- "interactive" scanners (see the --BB and --II flags)
- define the macro YYYY__IINNTTEERRAACCTTIIVVEE.. If you redefine
- LLeexxeerrIInnppuutt(()) and need to take different actions
- depending on whether or not the scanner might be
- scanning an interactive input source, you can test
- for the presence of this name via ##iiffddeeff..
-
- vviirrttuuaall vvooiidd LLeexxeerrOOuuttppuutt(( ccoonnsstt cchhaarr** bbuuff,, iinntt ssiizzee ))
- writes out ssiizzee characters from the buffer bbuuff,,
- which, while NUL-terminated, may also contain
- "internal" NUL's if the scanner's rules can match
- text with NUL's in them.
-
- vviirrttuuaall vvooiidd LLeexxeerrEErrrroorr(( ccoonnsstt cchhaarr** mmssgg ))
- reports a fatal error message. The default version
- of this function writes the message to the stream
- cceerrrr and exits.
-
- Note that a yyyyFFlleexxLLeexxeerr object contains its _e_n_t_i_r_e scan-
- ning state. Thus you can use such objects to create reen-
- trant scanners. You can instantiate multiple instances of
- the same yyyyFFlleexxLLeexxeerr class, and you can also combine mul-
- tiple C++ scanner classes together in the same program
- using the --PP option discussed above.
-
- Finally, note that the %%aarrrraayy feature is not available to
- C++ scanner classes; you must use %%ppooiinntteerr (the default).
-
- Here is an example of a simple C++ scanner:
-
- // An example of using the flex C++ scanner class.
-
-
-
- Version 2.4 November 1993 35
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- %{
- int mylineno = 0;
- %}
-
- string \"[^\n"]+\"
-
- ws [ \t]+
-
- alpha [A-Za-z]
- dig [0-9]
- name ({alpha}|{dig}|\$)({alpha}|{dig}|[_.\-/$])*
- num1 [-+]?{dig}+\.?([eE][-+]?{dig}+)?
- num2 [-+]?{dig}*\.{dig}+([eE][-+]?{dig}+)?
- number {num1}|{num2}
-
- %%
-
- {ws} /* skip blanks and tabs */
-
- "/*" {
- int c;
-
- while((c = yyinput()) != 0)
- {
- if(c == '\n')
- ++mylineno;
-
- else if(c == '*')
- {
- if((c = yyinput()) == '/')
- break;
- else
- unput(c);
- }
- }
- }
-
- {number} cout << "number " << YYText() << '\n';
-
- \n mylineno++;
-
- {name} cout << "name " << YYText() << '\n';
-
- {string} cout << "string " << YYText() << '\n';
-
- %%
-
- int main( int /* argc */, char** /* argv */ )
- {
- FlexLexer* lexer = new yyFlexLexer;
- while(lexer->yylex() != 0)
- ;
- return 0;
- }
-
-
-
- Version 2.4 November 1993 36
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- IMPORTANT: the present form of the scanning class is
- _e_x_p_e_r_i_m_e_n_t_a_l and may change considerably between major
- releases.
-
- IINNCCOOMMPPAATTIIBBIILLIITTIIEESS WWIITTHH LLEEXX AANNDD PPOOSSIIXX
- _f_l_e_x is a rewrite of the AT&T Unix _l_e_x tool (the two
- implementations do not share any code, though), with some
- extensions and incompatibilities, both of which are of
- concern to those who wish to write scanners acceptable to
- either implementation. The POSIX _l_e_x specification is
- closer to _f_l_e_x_'_s behavior than that of the original _l_e_x
- implementation, but there also remain some incompatibili-
- ties between _f_l_e_x and POSIX. The intent is that ulti-
- mately _f_l_e_x will be fully POSIX-conformant. In this sec-
- tion we discuss all of the known areas of incompatibility.
-
- _f_l_e_x_'_s --ll option turns on maximum compatibility with the
- original AT&T _l_e_x implementation, at the cost of a major
- loss in the generated scanner's performance. We note
- below which incompatibilities can be overcome using the --ll
- option.
-
- _f_l_e_x is fully compatible with _l_e_x with the following
- exceptions:
-
- - The undocumented _l_e_x scanner internal variable
- yyyylliinneennoo is not supported unless --ll is used.
-
- yylineno is not part of the POSIX specification.
-
- - The iinnppuutt(()) routine is not redefinable, though it
- may be called to read characters following whatever
- has been matched by a rule. If iinnppuutt(()) encounters
- an end-of-file the normal yyyywwrraapp(()) processing is
- done. A ``real'' end-of-file is returned by
- iinnppuutt(()) as _E_O_F_.
-
- Input is instead controlled by defining the
- YYYY__IINNPPUUTT macro.
-
- The _f_l_e_x restriction that iinnppuutt(()) cannot be rede-
- fined is in accordance with the POSIX specifica-
- tion, which simply does not specify any way of con-
- trolling the scanner's input other than by making
- an initial assignment to _y_y_i_n_.
-
- - _f_l_e_x scanners are not as reentrant as _l_e_x scanners.
- In particular, if you have an interactive scanner
- and an interrupt handler which long-jumps out of
- the scanner, and the scanner is subsequently called
- again, you may get the following message:
-
- fatal flex scanner internal error--end of buffer missed
-
-
-
-
- Version 2.4 November 1993 37
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- To reenter the scanner, first use
-
- yyrestart( yyin );
-
- Note that this call will throw away any buffered
- input; usually this isn't a problem with an inter-
- active scanner.
-
- Also note that flex C++ scanner classes _a_r_e reen-
- trant, so if using C++ is an option for you, you
- should use them instead. See "Generating C++ Scan-
- ners" above for details.
-
- - oouuttppuutt(()) is not supported. Output from the EECCHHOO
- macro is done to the file-pointer _y_y_o_u_t (default
- _s_t_d_o_u_t_)_.
-
- oouuttppuutt(()) is not part of the POSIX specification.
-
- - _l_e_x does not support exclusive start conditions
- (%x), though they are in the POSIX specification.
-
- - When definitions are expanded, _f_l_e_x encloses them
- in parentheses. With lex, the following:
-
- NAME [A-Z][A-Z0-9]*
- %%
- foo{NAME}? printf( "Found it\n" );
- %%
-
- will not match the string "foo" because when the
- macro is expanded the rule is equivalent to "foo[A-
- Z][A-Z0-9]*?" and the precedence is such that the
- '?' is associated with "[A-Z0-9]*". With _f_l_e_x_, the
- rule will be expanded to "foo([A-Z][A-Z0-9]*)?" and
- so the string "foo" will match.
-
- Note that if the definition begins with ^^ or ends
- with $$ then it is _n_o_t expanded with parentheses, to
- allow these operators to appear in definitions
- without losing their special meanings. But the
- <<ss>>,, //,, and <<<<EEOOFF>>>> operators cannot be used in a
- _f_l_e_x definition.
-
- Using --ll results in the _l_e_x behavior of no paren-
- theses around the definition.
-
- The POSIX specification is that the definition be
- enclosed in parentheses.
-
- - The _l_e_x %%rr (generate a Ratfor scanner) option is
- not supported. It is not part of the POSIX speci-
- fication.
-
-
-
-
- Version 2.4 November 1993 38
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- - After a call to uunnppuutt(()),, _y_y_t_e_x_t and _y_y_l_e_n_g are
- undefined until the next token is matched, unless
- the scanner was built using %%aarrrraayy.. This is not
- the case with _l_e_x or the POSIX specification. The
- --ll option does away with this incompatibility.
-
- - The precedence of the {{}} (numeric range) operator
- is different. _l_e_x interprets "abc{1,3}" as "match
- one, two, or three occurrences of 'abc'", whereas
- _f_l_e_x interprets it as "match 'ab' followed by one,
- two, or three occurrences of 'c'". The latter is
- in agreement with the POSIX specification.
-
- - The precedence of the ^^ operator is different. _l_e_x
- interprets "^foo|bar" as "match either 'foo' at the
- beginning of a line, or 'bar' anywhere", whereas
- _f_l_e_x interprets it as "match either 'foo' or 'bar'
- if they come at the beginning of a line". The lat-
- ter is in agreement with the POSIX specification.
-
- - _y_y_i_n is _i_n_i_t_i_a_l_i_z_e_d by _l_e_x to be _s_t_d_i_n_; _f_l_e_x_, on
- the other hand, initializes _y_y_i_n to NULL and then
- _a_s_s_i_g_n_s it to _s_t_d_i_n the first time the scanner is
- called, providing _y_y_i_n has not already been
- assigned to a non-NULL value. The difference is
- subtle, but the net effect is that with _f_l_e_x scan-
- ners, _y_y_i_n does not have a valid value until the
- scanner has been called.
-
- The --ll option does away with this incompatibility.
-
- - The special table-size declarations such as %%aa sup-
- ported by _l_e_x are not required by _f_l_e_x scanners;
- _f_l_e_x ignores them.
-
- - The name FLEX_SCANNER is #define'd so scanners may
- be written for use with either _f_l_e_x or _l_e_x_.
-
- The following _f_l_e_x features are not included in _l_e_x or the
- POSIX specification:
-
- yyterminate()
- <<EOF>>
- <*>
- YY_DECL
- YY_START
- YY_USER_ACTION
- #line directives
- %{}'s around actions
- multiple actions on a line
-
- plus almost all of the flex flags. The last feature in
- the list refers to the fact that with _f_l_e_x you can put
- multiple actions on the same line, separated with semi-
-
-
-
- Version 2.4 November 1993 39
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- colons, while with _l_e_x_, the following
-
- foo handle_foo(); ++num_foos_seen;
-
- is (rather surprisingly) truncated to
-
- foo handle_foo();
-
- _f_l_e_x does not truncate the action. Actions that are not
- enclosed in braces are simply terminated at the end of the
- line.
-
- DDIIAAGGNNOOSSTTIICCSS
- _w_a_r_n_i_n_g_, _r_u_l_e _c_a_n_n_o_t _b_e _m_a_t_c_h_e_d indicates that the given
- rule cannot be matched because it follows other rules that
- will always match the same text as it. For example, in
- the following "foo" cannot be matched because it comes
- after an identifier "catch-all" rule:
-
- [a-z]+ got_identifier();
- foo got_foo();
-
- Using RREEJJEECCTT in a scanner suppresses this warning.
-
- _w_a_r_n_i_n_g_, --ss _o_p_t_i_o_n _g_i_v_e_n _b_u_t _d_e_f_a_u_l_t _r_u_l_e _c_a_n _b_e _m_a_t_c_h_e_d
- means that it is possible (perhaps only in a particular
- start condition) that the default rule (match any single
- character) is the only one that will match a particular
- input. Since --ss was given, presumably this is not
- intended.
-
- _r_e_j_e_c_t___u_s_e_d___b_u_t___n_o_t___d_e_t_e_c_t_e_d _u_n_d_e_f_i_n_e_d or
- _y_y_m_o_r_e___u_s_e_d___b_u_t___n_o_t___d_e_t_e_c_t_e_d _u_n_d_e_f_i_n_e_d _- These errors can
- occur at compile time. They indicate that the scanner
- uses RREEJJEECCTT or yyyymmoorree(()) but that _f_l_e_x failed to notice the
- fact, meaning that _f_l_e_x scanned the first two sections
- looking for occurrences of these actions and failed to
- find any, but somehow you snuck some in (via a #include
- file, for example). Make an explicit reference to the
- action in your _f_l_e_x input file. (Note that previously
- _f_l_e_x supported a %%uusseedd//%%uunnuusseedd mechanism for dealing with
- this problem; this feature is still supported but now dep-
- recated, and will go away soon unless the author hears
- from people who can argue compellingly that they need it.)
-
- _f_l_e_x _s_c_a_n_n_e_r _j_a_m_m_e_d _- a scanner compiled with --ss has
- encountered an input string which wasn't matched by any of
- its rules. This error can also occur due to internal
- problems.
-
- _t_o_k_e_n _t_o_o _l_a_r_g_e_, _e_x_c_e_e_d_s _Y_Y_L_M_A_X _- your scanner uses %%aarrrraayy
- and one of its rules matched a string longer than the YYYYLL--
- MMAAXX constant (8K bytes by default). You can increase the
- value by #define'ing YYYYLLMMAAXX in the definitions section of
-
-
-
- Version 2.4 November 1993 40
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- your _f_l_e_x input.
-
- _s_c_a_n_n_e_r _r_e_q_u_i_r_e_s _-_8 _f_l_a_g _t_o _u_s_e _t_h_e _c_h_a_r_a_c_t_e_r _'_x_' _- Your
- scanner specification includes recognizing the 8-bit char-
- acter _'_x_' and you did not specify the -8 flag, and your
- scanner defaulted to 7-bit because you used the --CCff or --CCFF
- table compression options. See the discussion of the --77
- flag for details.
-
- _f_l_e_x _s_c_a_n_n_e_r _p_u_s_h_-_b_a_c_k _o_v_e_r_f_l_o_w _- you used uunnppuutt(()) to push
- back so much text that the scanner's buffer could not hold
- both the pushed-back text and the current token in yyyytteexxtt..
- Ideally the scanner should dynamically resize the buffer
- in this case, but at present it does not.
-
- _i_n_p_u_t _b_u_f_f_e_r _o_v_e_r_f_l_o_w_, _c_a_n_'_t _e_n_l_a_r_g_e _b_u_f_f_e_r _b_e_c_a_u_s_e _s_c_a_n_-
- _n_e_r _u_s_e_s _R_E_J_E_C_T _- the scanner was working on matching an
- extremely large token and needed to expand the input
- buffer. This doesn't work with scanners that use RREEJJEECCTT..
-
- _f_a_t_a_l _f_l_e_x _s_c_a_n_n_e_r _i_n_t_e_r_n_a_l _e_r_r_o_r_-_-_e_n_d _o_f _b_u_f_f_e_r _m_i_s_s_e_d _-
- This can occur in an scanner which is reentered after a
- long-jump has jumped out (or over) the scanner's activa-
- tion frame. Before reentering the scanner, use:
-
- yyrestart( yyin );
-
- or, as noted above, switch to using the C++ scanner class.
-
- _t_o_o _m_a_n_y _s_t_a_r_t _c_o_n_d_i_t_i_o_n_s _i_n _<_> _c_o_n_s_t_r_u_c_t_! _- you listed
- more start conditions in a <> construct than exist (so you
- must have listed at least one of them twice).
-
- FFIILLEESS
- See flex(1).
-
- DDEEFFIICCIIEENNCCIIEESS // BBUUGGSS
- Again, see flex(1).
-
- SSEEEE AALLSSOO
- flex(1), lex(1), yacc(1), sed(1), awk(1).
-
- M. E. Lesk and E. Schmidt, _L_E_X _- _L_e_x_i_c_a_l _A_n_a_l_y_z_e_r _G_e_n_e_r_a_-
- _t_o_r
-
- AAUUTTHHOORR
- Vern Paxson, with the help of many ideas and much inspira-
- tion from Van Jacobson. Original version by Jef
- Poskanzer. The fast table representation is a partial
- implementation of a design done by Van Jacobson. The
- implementation was done by Kevin Gong and Vern Paxson.
-
- Thanks to the many _f_l_e_x beta-testers, feedbackers, and
- contributors, especially Francois Pinard, Casey Leedom,
-
-
-
- Version 2.4 November 1993 41
-
-
-
-
-
- FLEXDOC(1) FLEXDOC(1)
-
-
- Nelson H.F. Beebe, benson@odi.com, Peter A. Bigot, Keith
- Bostic, Frederic Brehm, Nick Christopher, Jason Coughlin,
- Bill Cox, Dave Curtis, Scott David Daniels, Chris G.
- Demetriou, Mike Donahue, Chuck Doucette, Tom Epperly, Leo
- Eskin, Chris Faylor, Jon Forrest, Kaveh R. Ghazi, Eric
- Goldman, Ulrich Grepel, Jan Hajic, Jarkko Hietaniemi, Eric
- Hughes, John Interrante, Ceriel Jacobs, Jeffrey R. Jones,
- Henry Juengst, Amir Katz, ken@ken.hilco.com, Kevin B.
- Kenny, Marq Kole, Ronald Lamprecht, Greg Lee, Craig Leres,
- John Levine, Steve Liddle, Mohamed el Lozy, Brian Madsen,
- Chris Metcalf, Luke Mewburn, Jim Meyering, G.T. Nicol,
- Landon Noll, Marc Nozell, Richard Ohnemus, Sven Panne,
- Roland Pesch, Walter Pelissero, Gaumond Pierre, Esmond
- Pitt, Jef Poskanzer, Joe Rahmeh, Frederic Raimbault, Rick
- Richardson, Kevin Rodgers, Jim Roskind, Doug Schmidt,
- Philippe Schnoebelen, Andreas Schwab, Alex Siegel, Mike
- Stump, Paul Stuart, Dave Tallman, Chris Thewalt, Paul
- Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent
- Williams, Ken Yap, Nathan Zelle, David Zuhn, and those
- whose names have slipped my marginal mail-archiving skills
- but whose contributions are appreciated all the same.
-
- Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John
- Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
- Nicol, Francois Pinard, Rich Salz, and Richard Stallman
- for help with various distribution headaches.
-
- Thanks to Esmond Pitt and Earle Horton for 8-bit character
- support; to Benson Margulies and Fred Burke for C++ sup-
- port; to Kent Williams and Tom Epperly for C++ class sup-
- port; to Ove Ewerlid for support of NUL's; and to Eric
- Hughes for support of multiple buffers.
-
- This work was primarily done when I was with the Real Time
- Systems Group at the Lawrence Berkeley Laboratory in
- Berkeley, CA. Many thanks to all there for the support I
- received.
-
- Send comments to:
-
- Vern Paxson
- Systems Engineering
- Bldg. 46A, Room 1123
- Lawrence Berkeley Laboratory
- University of California
- Berkeley, CA 94720
-
- vern@ee.lbl.gov
-
-
-
-
-
-
-
-
-
- Version 2.4 November 1993 42
-
-
-